Multidimensional biodata integration and relationship inference

ABSTRACT

This invention provides an advanced platform for the analysis of biological data that emphasizes pathway mapping and relationship inference based upon data acquired from multiple diverse sources ( 102 ). The platform employs a bioinformatic system ( 100 ) that integrates data from the diverse sources ( 102 ), connecting related genes and proteins and inferring biological functions ( 108 ) in the context of global cellular processes.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority under 35 U.S.C. 119(e)from U.S. S No. 60/298,689, filed Jun. 14, 2001, the disclosure of whichis incorporated by reference herein.

BACKGROUND OF THE INVENTION

[0002] This invention relates to systems for the collection andmanipulation of biodata from diverse sources and the processing of suchbiodata to identify potential therapeutic targets.

[0003] The human genome project, with its goal of complete genomesequencing, has examined three gigabytes of human genomic DNA andpredicts that approximately 30,000 genes are resident in the humangenome. However, identification and sequencing of a gene are but thefirst steps in its characterization. The challenge is to determine thefunction of the gene as well as its relationship to other genes. Withthis information, directed experimentation to identify genes that arelikely targets for therapeutic intervention becomes feasible and,ultimately, the drug discovery timeline will be shortened.

[0004] Genes contain genetic information that is transcribed intomessenger RNA and then translated into protein. Proteins play a criticalrole in cellular processes. Functional proteomics seeks to identify aprotein's function and related pathway roles through large-scale,high-throughput experiments. Protein functional analysis systematicallydetermines protein-protein interactions. Protein interactions mediatecellular signaling cascades that are not typically linear, but are morelikely represented by a complex branched network. When unknown proteinsinteract with previously characterized proteins, information about theirfunction and role in the same or related cellular process may beobtained.

[0005] Most commercially available bioinformatics systems performfunctional analysis using a single information source such as atraditional relational database optimized for transactional databaseprocessing. Such systems do not integrate collections of data fromvarious sources. Conversely, an intelligent system that integrates dataderived from multiple sources would allow for the integration of datafrom various operational databases and, thus, enhance research effortswhich focus on specific therapeutic targets.

SUMMARY OF THE INVENTION

[0006] This invention provides an advanced platform for the analysis ofbiological data that relies upon pathway mapping and relationshipinferences drawn from data acquired from multiple diverse sources. Theplatform employs a bioinformatic system that integrates data from thediverse sources, connecting related genes and proteins and inferringbiological functions in the context of global cellular processes.

[0007] The invention may be conceptually understood as having fourprimary components: data collection, data integration, dataanalysis/relationship inference and inference presentation.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008]FIG. 1. Data analysis and data warehouse system. Data from varioussources is input into the operational relational database system(RDBMS), extracted and cleaned, then loaded into the data warehousesystem, which organizes the data and infers relationships among thestored data, predicting missing attribute values for incoming data.

[0009]FIG. 2. Sample data view in a multidimensional database.

[0010]FIG. 3. Analysis of two characteristic matrices. The left panelmatrix has the same cardinality in both dimensions. The matrix containseach gene's binding patterns. Domain analysis of each of the 391 genesresulted in a total of 849 domains as shown in the right panel. Thenegative normalized logarithm of the F-value was concatenated to thefirst matrix as shown in the right panel.

[0011]FIG. 4. A protein-protein interaction network in a human cellprobed by the YTH system. Each gene is represented by a gray dot, andthe edges connecting two genes represent specific interactions betweenthe genes. For the apoptosis example, green represents the genes fromthe YTH matrix analysis, and blue represents additional genes fromanalyses of both YTH and domain information. As the networkdemonstrates, TNF-induced apoptosis pathway members (blue and green) arescattered in the YTH network, whereas domain information links themtogether.

[0012]FIG. 5. Genes identified in the apoptosis pathway of the networkare represented as a hierarchical tree format. The genes closer to eachother on a branch of the tree have a higher correlation based onspecific interactions in the YTH system.

[0013]FIG. 6. Genes identified from the clustering analysis based on YTHand domain data. Genes in green are identified by YTH data, genes inblue are newly identified when domain data is added.

[0014]FIG. 7. Flowchart outlining the system of this invention.

DETAILED DESCRIPTION OF THE INVENTION

[0015] Embodiments of the present invention provide techniques foridentifying potential therapeutic targets. In order to identifypotential therapeutic targets, an embodiment of the present inventiondetermines characteristics of genes and proteins based upon informationavailable from various public and private information sources. FIG. 7 isa simplified high-level flowchart 100 depicting a method of determiningcharacteristics of genes and/or proteins based upon informationavailable from various public and private information sources accordingto an embodiment of the present invention. The method may be performedby a data processing system that may comprise a memory subsystem and oneor more processors. For example, the processing depicted in flowchart100 may be performed by software modules that are stored by the memorysubsystem of the data processing system and are executed by one or moreprocessors of the data processing system. The processing may also beperformed by hardware modules coupled to the data processing system, orby a combination of software modules and hardware modules. It should beunderstood that flowchart 100 is merely illustrative of an embodimentincorporating the present invention and does not limit the scope of theinvention as recited in the claims. One of ordinary skill in the artwould recognize variations, modifications, and alternatives.

[0016] As depicted in flowchart 100, information from variousinformation sources is gathered (step 102). As described below infurther detail, the information sources may include public and privateinformation sources (e.g., publicly accessible gene databases such asGenBank, databases storing micro-array information such as the StanfordMicro-array Depository, databases that store published information suchas Medline, private information sources, experiments, and the like). Theinformation may be collected manually or in an automated manner.

[0017] The information collected in step 102 is then stored in a formatthat facilitates analysis of the information (step 104). As describedbelow in further detail, the information may be stored in one or moredatabases that are integrated using a data warehouse-basedinfrastructure. According to an embodiment of the present invention, theinformation is also stored or represented in the form of characteristicmatrices. Each characteristic matrix may store and represent informationfor a particular dimension of data. For example, a first characteristicmatrix may store information related to functional assays, a secondcharacteristic matrix may store information related to protein-proteininteractions, a third characteristic matrix may store informationrelated to ontology mappings, a fourth characteristic matrix may storeinformation related to fold recognition, and on. Further information onthe different characteristic matrices that may be used is describedbelow.

[0018] The information stored in step 104 is then analyzed (step 106).Various different analysis techniques may be used. For example,according to an embodiment of the present invention, multivariateanalysis techniques including clustering analysis techniques are used toanalyze the information stored in step 104. According to an embodimentof the present invention, the information that is stored in databasesand information that is represented by the characteristic matrices isanalyzed. The characteristic matrices facilitate multidimensionalanalysis of the data.

[0019] Inferences, deductions, and/or conclusions are then drawn basedupon the results of the analysis performed in step 106 (step 108). Forexample, according to an embodiment of the present invention, clusteringanalysis yields clusters of genes and proteins that are co-relatedtogether. Inferences can then be drawn from the clusters formed as aresult of the clustering analysis. For example, genes and proteins withsimilar profiles based upon various biological experiments/observationsthat are clustered together are more likely to have similar cellularfunctions and be involved in the same biological pathways. As a result,characteristics of novel (or previously unknown) genes and functions ofnovel proteins can be inferred based upon characteristics and functionsof the known genes or proteins in the same cluster. In this manner,various inferences, deductions, and conclusions can be drawn from theresults of the analysis performed in step 106.

[0020] The inferences, deductions, and/or conclusions drawn in step 108may then be output to a user (step 110). Various different techniquesmay be used to output the results to the user. For example, according toan embodiment of the present invention, the information may be output tothe user in response to a query received from the user. Variousdifferent user interfaces may be used to output the information to theuser.

[0021] The processing performed in each step of flowchart 100 isdescribed below in further detail.

[0022] The four primary components of the multidimensional biodataintegration and relationship inference platform are described in detailbelow.

[0023] a) Definitions

[0024] The following definitions are set forth to illustrate and definethe meaning and scope of the various terms used to describe theinvention.

[0025] The term “therapeutic target” means any environment or molecule(often a gene or a protein) that is instrumental to a disease process,though not necessarily directly involved, with the intention of findinga way to regulate that environment's or molecule's activity fortherapeutic purposes.

[0026] By the term “biodata” is meant, for the purpose of thespecification and claims, any biological data compiled in one or moredatabase(s) and/or data warehouse including, but not limited to,biological data related to molecular pathways, cellular processes,protein-protein interaction, protein structure, genetics, molecularbiology, expression arrays, functional assays, and genomes.

[0027] The term “data warehouse” refers to a repository where data frommultiple databases is brought together for more complex analysis. It isalso a physical repository where relational data are specially organizedto provide enterprise-wide, cleaned data in a standardized format.

[0028] The term “data mart” means a subset of a data warehouse wheredata relevant to a particular query is stored.

[0029] The term “archival biological data” means biological data thathas been archived or compiled in one or more database(s) and/or datawarehouse(s) and can be accessed by a user. Examples of archivalbiological data are domain analyses, ontology vocabulary mapping, foldrecognition, gene sequences, expressed sequence tags (ESTs), singlepolynucleotide polymorphisms (SNIPs), biochemical functions,physiological roles and structure/function relationships.

[0030] A “multidimensional database” as used herein, is a database inwhich data is organized and summarize in multiple dimensions for easiercomprehension. By performing queries, users can create customized slicesof data by combining various fields or dimensions.

[0031] The term “compound library” refers to a large collection ofcompounds with different chemical properties or shapes, generated eitherby combinatorial chemistry or some other process or by collectingsamples with interesting biological properties. This compound librarycan be screened for drug targets. For example, a lead compound is apotential drug candidate emerging from a screening process of a largelibrary of compounds.

[0032] “High-throughput screening (HTS)” refers to rapid in vitroscreening of large numbers of compound libraries (generally tens tothousands of compounds), using robotic screening assays.

[0033] A “protein array” or “protein microarray” refers to a multi-spot,metallic or polymeric device with surface chemistries used for affinitycapture of proteins from complex biological samples. A “protein arrayanalysis” refers to the use of a protein array in order to evaluatepotential polypeptide(s) of interest. For example, using a proteinmicroarray made up of several hundred antibodies it is possible tomonitor alterations of protein levels in specific cells treated withvarious agents.

[0034] A “nucleic acid array” or “DNA microarray” as used herein means adevice (e.g., glass slide) for studying how large numbers of nucleicacids (e.g., cDNA, genomic DNA, RNA, mRNA, SiRNA, etc.) interact witheach other and how a cell controls vast numbers of nucleic acidssimultaneously. Tiny droplets containing nucleic acids are applied toslides and fluorescently labeled probes are allowed to bind to thecomplementary nucleic acid strands on the slides. The slides are scannedand the brightness of each fluorescent dot is measured. The brightnessof the dot reveals the presence and quantity of a specific nucleic acid.A “nucleic acid array analysis” employs such arrays for the analysis ofnucleic acid(s) of interest.

[0035] The term “functional gene screening” refers to the use of abiochemical assays in order to screen for a specific protein, whichindicates that a specific gene is not merely present but active.

[0036] The term “expressed sequence tag (EST)” referst to a nucleic acidsequence made from cDNA which comprises a small part of a gene. An ESTcan be used to detect the gene by hybridizing the EST with part of thegene. The EST can be radioactively labeled in order to locate it in alarger segment of DNA A “single nucleotide polymorphism (SNP)” refers tochanges in a single base pair of a particular gene happeningsimultaneously in a population.

[0037] The term “structure/function relationship” refers to the therelationship between the structure and organization of the gene and thefunction of the gene as it directs growth, development, physiologicalactivities, and other life processes of the organism. It also refers tothe structure and organization of the protein and function of theprotein as cellular building block and/or participant in cellularprocesses and pathways.

[0038] The term “cleaned” means, for the purpose of the specificationand claims, the process of mating data that is being imported into adata warehouse more accurate by removing mistakes and inconsistencies.

[0039] The term “time variant data” refers to data whose accuracy isrelevant to any one moment in time. Thus, the term “time variantdatabase” refers to a database that contains, but is not limited to,such time variant data.

[0040] “Cluster analysis”, as used herein, refers to clustering, orgrouping, of large data sets (e.g., biological data sets) on the basisof similarity criteria for appropriately scaled variables that representthe data of interest. Similarity criteria (distance based, associative,correlative, probabilistic) among the several clusters facilitate therecognition of patterns and reveal otherwise hidden structures.

[0041] An “supervised analysis” refers to a data analysis techniquewhereby a model, is built without a well defined goal or predictionfield. The systems are used for exploration and general dataorganization. An “unsupervised clustering analysis” is an example of anunsupervised analysis.

[0042] The term “algorithm” means a procedure used to solve amathematical or computational problem or to address a data processingissue. In the latter sense, an algorithm is a set of step-by-stepcommands or instructions designed to reach a particular goal or astep-by-step search, where improvement is made in every step until thebest solution is found.

[0043] A “neural net” or “artificial neural network” refers to computertechnology that operates like a human brain, such that computers possesssimultaneous memory storage and work with ambiguous information.

[0044] A “relational database” refers to a database in which data isstored in multiple tables. These tables then “relate” to one another tomake up the entire database. Queries can be run to “join” these relatedtables together. An “operational database” comprises system-specificreference data and event data belonging to a transaction-update system.It may also contain system control data such as indicators, flags, andcounters. The operational database is the source of data for the datawarehouse. The data continually changes as updates are made, and reflectthe current value of the last transaction.

[0045] b) Data Collection

[0046] Data may be obtained from both experimental and archival sources.The data sources are diverse and constitute the basic input for thesystem. Representative data may be obtained, inter alia, from genefunctional assays, protein-protein interaction studies (e.g. yeasttwo-hybrid screening, proteomic chip or chromatographic analyses),ontology vocabulary mapping, fold recognition, nucleic acid arrayexpression data, gene domain analysis, and proprietary, published orotherwise publicly accessible archived gene and protein sequences,EST's, structure-function relationships, and related chemical, clinicaland physical data. In a given embodiment, any number of data sources maybe called upon to invoke the platform.

[0047] Functional assays. Cell cycle functional screening assays may beused to identify small fragments of genes or peptides that cause arrestat different cell cycle phases. Information associated with each assayincludes experimental protocols, reagents, and raw data such aselectrophoresis gel images.

[0048] Protein-protein interactions. To further characterize proteinsdemonstrating a desired functional assay phenotype, proteins interactingwith each other may be identified through a variety of techniques,including, but not limited to yeast two-hybrid (YTH) screening andproteomic chip and chromatographic analyses.

[0049] The yeast-based two-hybrid (YTH system is a common method forlarge-scale experimental detection of protein-protein interactions. InYTH systems, the protein of interest (the bait) is fused to a fragmentof a known DNA-binding protein such as GAL4, anchoring the bait to acalorimetric reporter gene. Potential interacting proteins (screenedmembers expressed from a cDNA library) are then attached to a cognatetranscriptional activating protein that can activate the reporter gene,producing an easily monitored color change in yeast cells. This colorchange results from the direct physical interaction between theproteins.

[0050] Mapping protein interactions through protein chip patterns is analternative high-throughput methodology. In addition, 2D polyacrylamidegel electrophoresis (2D GEL) coupled with mass spectroscopy now providesproteomic fingerprints, digestion patterns of protein complexes indifferent cellular states.

[0051] LIMS laboratory Information Management Systems) is used to tracklarge scale cloning process, sequencing and down stream data generatedfrom different functional assays, yeast-two-hybrid screening, andproteomics experiment. Functional screens include, but are not limitedto cell cycle regulation, angiogenesis, T cell activation, B cellactivation, IgE class switch, etc.

[0052] For example, the goal of cell cycle functional assay is toidentify genes that cause arrest at different cell cycle phases. Theidentified genes are then evaluated as targets for therapeuticintervention to slow tumor growth or tumor genesis. The process may beis described as follows. Cell tracker dye is applied to tumor cells thathost different cDNA fragments. After several days growth, the slowreplication cells are sorted out using FACS (fluorescence-activated cellsorter). The cDNA that confers this phenotype is then extracted usingRT-PCR (reverse transcriptase-polymerase chain reaction). Informationassociated with the whole process includes experimental protocols,reagents, gene sequences, and raw data (such as histograms or dot plotsgenerated by FACS). Raw data results from each functional assay are thentransferred to data warehouse for storage.

[0053] LIMS is a task-based workflow system implemented in JAVAtechnology with a generic schema and open architecture. It supportsvarious workflow patterns fork, option, loop, merge, two types of nodes(entity node, bridge node) and exit/entry conditions. Tasks are assignedbased on roles or inherited from the parent tasks. A message system isused to send notices to users for pipeline modification. Securitymanagement, transaction management and resource management are alsoimplemented in the system. The scheduler can schedule certain tasksbased on one time execution or repeat execution. XML technology is usedgenerally for the workflow descriptor and object distribution. Theworkflow deployment tool deploys the specific pipeline based on thedescriptor. The front end of this system is browser based.

[0054] Data from public sources are gathered automatically by webcrawler (written in PERL script language) or periodically dispatchedUNIX processes. Various parsers are developed to extract the informationwe need for downstream data warehouse storage. These data sourcesinclude: ontology classification, co-regulation value based onexpression array, public sequences (EST, protein and nucleotide), andindividual genomes from various species as described in the following.

[0055] We use ncftp which has been scheduled by Unix cron job runningevery week to fetch the EST, proteins and nucleotides from NCBI atftp://ftp.ncbi.nih.gov/blast/db. The raw data is stored on server disk.A LWP based Perl script takes three URLs (http://www.tigr.org,http://genome.ucsc.edu, http://www.fruitfly.org/) as its input and theoutput is parsed and stored as a text file. The following data sourcesare generated based on computation applied to the original data sourcesfrom the above: domain analysis, fold recognition, and protein linksbased on multiple genomes.

[0056] Published literature may also be used either by manuallyextracting information or using natural language processing (NLP) forincorporation into database.

[0057] Ontology vocabulary mapping. An ontology provides a formalwritten description of a specific set of concepts and theirrelationships in a particular domain (P. D. Karp, An Ontology forBiological Function Based on Molecular Interactions, Bioinformatics,vol. 16, 2000, pp. 269-285). One ontology is based upon StanfordUniversity's Gene Ontology (GO) Consortium (www.geneontology.org). TheLWP based Perl script uses H=Ii GET to fetch the three ontology files(component.ontology, process.ontology, function.ontology) from the ftpsite (ftp://ftp.geneontology.org/pub/go/ontology/). The files are storedon the server disk and are parsed by matrix generation program.

[0058] GO has three categories: molecular function, biological process,and cellular component. A gene product has one or more molecularfunctions and participates in one or more biological processes. The geneproduct might be a cellular component or it might be associated with oneor more such components. Each element's ontology is represented on anacyclic directed graph. The nodes at the upper branches have moregeneral characteristics, while end nodes have relatively specificattributes, including inheritance of parental, characteristics.

[0059] Fold recognition. Although there are several hundred thousandproteins in the nonredundant protein database at the US National Centerfor Biotechnology Information (NCBI), it is estimated that there areonly about 5,000 unique native 3D structures or folds. Most frequentlyoccurring folds were determined experimentally. PROSPECT, a threadingpackage developed at Oak Ridge National Lab was used to predict novelgene folds (Y. Xu and D. Xu, Protein Threading Using Prospect: Designand Evaluation, Proteins, vol. 40, 2000, pp. 343-354). Threadingsearches use structure templates to find a query's best fit. PROSPECThas three components:

[0060] Libraries of representative 3D protein structures for use astemplates, including protein chain (2,177 templates defined by thefamilies of structurally similar proteins [FSSP] nonredundant set) andcompact domains (771 domains defined by the distance-matrix-alignment[DALI] nonredundant domain library).

[0061] A knowledge-based energy function describing the fitness betweenthe query sequence and potential templates.

[0062] A “divide-and-conquer” threading algorithm that searches for thelowest energy match among the possible alignments of a givenquery-template pair. The algorithm first aligns elements of the querysequence and the template, and then merges the partial results to forman optimal global alignment.

[0063] A neural network derives a criterion to estimate the predictedstructure's confidence level. Typically, the criterion selects aboutfive statistically significant hits for a query protein.

[0064] Coregulation from array expression. Nucleic acid arrays hybridizelabeled RNA or DNA in solution to nucleic acid molecules attached atspecific locations on high-density array surfaces. Hybridization of asample to an array is a highly parallel search allowing complex mixturesof RNA and DNA to be interrogated in a high throughput and quantitativefashion. DNA arrays can be used for many different purposes, butpredominantly they measure levels of gene expression (messenger RNAabundance) for tens of thousands of genes simultaneously (D. J. Lockhartand E A Winzeler, Genomics, Gene Expression and DNA Arrays, Nature, vol.405, 2000, pp. 827-836). Chips with hundreds or thousands ofoligonucleotide sequences representing partial gene sequences may beconstructed. Hybridizing mRNA derived from different samples, forexample, cancerous versus normal tissue, provides information about geneexpression under different cellular conditions. Gene function may beinferred by correlating differential mRNA expression patterns.

[0065] If a gene has no previous functional assignments, one can give ita tentative assignment or a role in a biological process based on theknown functions of genes in the same expression cluster (the“guilt-by-association” concept). This is possible because genes withsimilar expression behavior (for example, parallel increases anddecreases under similar circumstances) tend to be related functionally.

[0066] Collaboratively, the National Cancer Institute (NCI) and StanfordUniversity have tested the expression of 8,000 unique genes in 60 celllines used in NCI's anticancer drug screening. The Stanford microarraydepository website:

[0067](http://genome-www5.stanford.edu/cgi-bin/SMD/listMicroArrayData.pl?tableName=publication&5306)

[0068] may be used as a data source. The program is implemented by PERLLWP as a Unix terminal command-based script that has been scheduled torun every month for data update. The input of the program is theStandford microarray database URL and output is the parsed ASCII-textbased flat file. The program uses the GET method from the HTTP protocolto fetch the HTML-format data from the above web site, then parses thedata into an ASCII-text based flat file and uses the hard disk as itssecondary persistent data storage.

[0069] Domain analysis. Gene domain analysis is based on a hidden Markovmodels search (hmmsearch, HMMER 2.0 suite) against the Pfam model set of2,773 domains downloaded from http://pfam.wustl.edu with an E-valuecutoff of 0.0005. A multiple-thread Perl program is implemented for thedomain analysis. It takes a FASTA format protein file as the input andinvokes a system call to trigger the Pfam search for each proteinsequence. It then parses the raw output to a hash data structure whichstores domain name, e-value, alignment position/gap for each protein.The hash data structure is persistent by the Unix file system.

[0070] Protein links based on multiple genomes. HTS allows an increasingnumber of genomes to be sequenced. Given the assumption that genespresent in the genomes of multiple species share similar evolutionaryhistories (phylogeny) and might therefore share similar functions,Eisenberg and colleagues proposed to infer potential protein-proteinlinks from genes with similar phylogenetic profiles (D. Eisenberg etal., Protein Function is the Post-Genomic Era, Nature, vol. 405, 2000,pp. 823-826). Approximately 50 completed genomes from the Institute forGenome Research and the Sanger Center were used to generate phylogeneticprofiles (a vector with length of 50) for each gene. The vector value isthe actual E-value obtained from the Basic Local Alignment Search Tool(BLAST) search.

[0071] c) Data Characterization

[0072] Separate individual operational relational databases areconstructed to store the raw information from each of the abovedimensions. There is no limit to the number of sources or dimensionswhich may be relied upon, although typically from five to twenty,preferably six to ten, are used. These databases provide theconventional query and search based on a single type of data source.

[0073] In additional to the raw data storage, each source is alsoconverted into a numerical matrix. The result is what is called acharacteristic matrix. Rows stand for genes of interest, and columnsrepresent a particular attribute within that dimension. The element ofthe matrix is the value of each gene fit to that attribute.

[0074] The numerical matrix may be generated for each individual datasource as follows.

[0075] For the functional assay, the row is the gene and the column isthe assay type/name. The matrix element is the degree of inhibition oractivation for each functional assay.

[0076] Specific protein-protein interactions are shown as a blue colorassay in the YTH screening experiment. The binding affinity degree is isrepresented as one of four levels: strong, medium, weak, and none,corresponding to 1, 0.8, 0.6 and 0 in the binding matrix. The level ofprotein-protein interaction extracted from the proteomics experiment isbased on eight internal experimental parameters. The row is the baitgene, the column is the hit gene. The element is the binding affinitydescribed above.

[0077] For ontology mapping, a high-dimension word space is constructedby extracting the description words for each GO node and merging themwith the key words parsed from the Medline title and abstract of eachgene. Each unique key word is one dimension of the word space. Thedistances between each gene and the GO nodes are calculated and rankedby the Euclidean distance in the word space. The gene's definitive GOmapping nodes are either manually selected or the top five ranked GOnodes atre chosen to represent the relationship between the gene ofinterest and the GO vocabulary. The row of the matrix is the gene, thecolumn is the unique description word from GO and the element is themapping described above.

[0078] For fold recognition, the row of the matrix is the gene and thecolumn is protein with crystal structure. The element is either zero ifthere is no match between the gene of interest to the crystal structureor the actual confidence score outputs from the neural network describedin the previous section.

[0079] For the co-regulation value from the expression array, the row ofthe matrix is the gene and the column is the condition of the experimentperformed. The element is the logarithm of the ratio of expressionlevels of a particular gene under two different experimental conditions.

[0080] For domain analysis, the row of the matrix is the gene and thecolumn is the domain name obtained from Pfam. The element is thenegative logarithm of the E-value obtained from the HMM search.

[0081] For protein links based on multiple genomes, the row of thematrix is the gene and the column is the name of the complete genome.The element is the actual negative logarithm of the lowest E-valueobtained from the BLAST search queried by the gene against thatparticular complete genome.

[0082] Each characteristic matrix may be processed individually or byconcatenating them into a larger matrix with multiple metricmeasurements as a part of the multidimensional analysis. Optionally,each matrix may be weighted using a variety of techniques. In oneembodiment, each data source is equally weighted, i.e., with equalweight of 1. In another embodiment, based on prior knowledge, decreasethe weight on data sources with higher false positives or lessbiological significance and/or increase the weight on more reliable orsignificant data. In a third embodiment, Bayesian statistics may be usedto calculate the conditional posterior probability of the weight foreach source based on the likelihood of observing the data based on thedata source as the following:${P\left( S \middle| D \right)} = \frac{{P\left( D \middle| S \right)}*{P(S)}}{\sum\limits_{s}\quad {{P\left( D \middle| S \right)}*{P(S)}}}$

[0083] where S is the data source, D is the characteristic value of eachgene for that particular data source. P(D|S) is the likelihood ofobserving the data given this particular data source, P(S) is the aprior of the data source, which can be uniformly distributed in thebeginning.

[0084] d) Data Integration

[0085] The data warehouse-based infrastructure supports bioinformaticanalysis of potential molecular therapeutic targets. Unlike traditionalrelational database management systems (RDBMSs), which are optimized fortransactional database processing, the data warehouse of this inventionis an integrated, time-variant, nonvolatile collection of data fromvarious operational databases. In this bioinformatics system,operational relational databases contain data generated from specificmethodologies, as discussed above.

[0086] A gene-centric analysis of the data warehouse allows large-scaledata integration, relationship learning, and decision-making based ondaily updated operational relational databases. The entire system servestwo primary purposes:

[0087] It organizes existing data to facilitate complex queries.

[0088] It infers relationships based on the stored data and subsequentlypredicts missing attribute values for incoming information based onmultidimensional data.

[0089] Original data from a plurality of dimensions, or independentsources, is entered (i.e., parsed) into operational databases andupdated daily. The operational databases are based on the relationaldatabase system and are tuned to support large-scale and frequenttransactions. Data analysis engines extract, clean, and analyze the datafrom these databases and load the analyzed data and metadata into thedata warehouse. The core data warehouse system design is based on thestar schema of Oracle's 8i, which implements multidimensional databases.

[0090] Data marts (extensions of the data warehouse) support differentquery requests. Data marts derive data from the central data warehousein response to different inquiries (i.e., queries). A detailed-summarysystem is a data mart built on the traditional RDBMS design with fixedmetadata tables to summarize and aggregate raw data. Another type ofdata mart is based on a multidimensional database design. The latteroffers superior representations of diverse data views, which it obtainsby comparing various aspects of the analysis environment at differentdetail levels.

[0091] The star schema uses de-normalized storage to provide data viewsfrom individual or multiple dimensions with high efficiency. It offersmultidimensional solutions that analyze large amounts of data with veryfast response times, “slices and dices” through the data, and drillsdown or rolls up through various dimensions defined by the datastructure. It is also easy to scale.

[0092] In order to easily visualize data from multiple views, the systemis limited to a 3D data cube. In one example, genes discovered in thefunctional assays of two projects, a T and B lymphocyte cell project andcell-cycle project, are used to construct a data cube as a function oftime and each gene's domains. As FIG. 2 shows, a cube slice represents adata view as a function of the various dimensions. Regular normalizedRDBMS design requires expensive multiple table join operations toextract the information in the data cube, whereas the multidimensionaldatabase exploits the schema design to facilitate a faster and morereliable response.

[0093] d) Data Analysis and Relationship Inference

[0094] There are numerous methods known in the art to explorerelationship extraction and inference deduction. In one embodiment, thesystem of this invention uses cluster analysis, a multivariate analysistechnique that seeks to organize information about variables to formrelatively homogeneous groups. Some common cluster analysis methods arehierarchical, k-mean, self-organizing, mapping, and support vectormachines. All of these methods employ distance functions to compare rawdata and recognize grouping characteristics.

[0095] Hierarchical clustering may be applied to the characteristicmatrices. Hierarchical clustering offers flexibility in determining theexact cluster number and statistical assessment of members' relatednessalong the branching tree (M. B. Eisen et al., Cluster Analysis andDisplay of Genome-Wide Expression Patterns, Proc. Nat'l Academy ofScience, USA 95, Nat'l Academy of Sciences, Washington D.C., 1998, pp.14,863-14,868). The distance matrix is based on the Pearson correlationcoefficient between any of the two genes X and Y from the originalcharacteristic matrix: $\begin{matrix}{\rho_{x,y} = {\frac{1}{N}{\sum\limits_{i = 1}^{N}\quad {\left( \frac{X_{i} - \overset{\_}{X}}{\sigma_{X}} \right)\left( \frac{Y_{i} - \overset{\_}{Y}}{\sigma_{Y}} \right)}}}} & (1)\end{matrix}$

[0096] where N is the characteristic matrix dimension in the attributedirection, and σ is the standard deviation of the gene attribute.$\begin{matrix}{\sigma_{A} = \sqrt{\sum\limits_{i = 1}^{N}\quad \frac{\left( {A_{i} - \overset{\_}{A}} \right)^{2}}{N}}} & (2)\end{matrix}$

[0097] Clustering is achieved by recursively joining the two elementswith the highest Pearson correlation coefficients in the upper-diagonaldistance matrix until the distance matrix dimension reduces to one inthe direction that the joining is performed. This process leads toclustering of genes with a similar binding vector profile. A tree-baseddendrogram is produced with end nodes of relatively high correlation toeach other, and nodes in the upper branch with less correlation to eachother. The clustering algorithm uses an unsupervised hierarchicalclustering that discovers common characteristics among genes withoutprior knowledge of them. As knowledge accumulates, it may be helpful touse a set of truly related genes as the training set for a relationshipinference model.

[0098] Support vector machines (SVM), a class of supervised clusteringalgorithms, have been successfully applied to functional classificationwhen combining data from both array expression experiments andphylogenetic profiles (P. Pavlidis et al., Gene FunctionalClassification from Heterogeneous Data, Proc. 5^(th) Int'l Conf.Computational Molecular Biology, ACM Press, New York, 2001, pp.242-248). SVM, projects the original data into a higher dimensionalspace, the feature space, and defines a separating hyperplane todiscriminate class members from nonmembers, an operation difficult toperform in the original space. SVM does not require all clusters to havespherical contours, an underlying assumption of many unsupervisedclustering algorithms, such as k-mean.

[0099] For supervised leaning, support vector machine with an innerproduct raised to a certain power (n=1, 2, 3) may be used

K({right arrow over (X)},{right arrow over (Y)})=(({right arrow over(X)}*{right arrow over (Y)}/{square root}{square root over (X)}*{rightarrow over (X)}{square root}{square root over (Y)}*{right arrow over(Y)})+1)^(n)

[0100] or a radial basis kernel and a 2-norm soft margin

K({right arrow over (X)}*{right arrow over (Y)})=exp(−∥{right arrow over(X)}−{right arrow over (Y)}∥ ²/2σ²)

[0101] where {right arrow over (X)} is the vector from the matrix thatdescribes gene X.

[0102] For supervised learning, feature selection is performed toextract the genes with the highest F score defined by the Fishercriterion to make the classification more accurate and informative:

F(j)=(μ⁺ _(j)−μ⁻ _(j))²/((σ⁺ _(j))²+(σ⁺ _(j))²)

[0103] Where μ⁺ _(j) and σ⁺ _(j) are the mean and standard deviation ofthat feature across the positive examples, and μ⁻ _(j), σ⁻ _(j) are fromthe negative examples.

[0104] After clustering or other classification, genes with similarvector descriptions within the matrix may be correlated as demonstratedin FIGS. 5 and 6. Genes and proteins with similar profiles based onvarious biological experiment/observations described above are morelikely to have similar cellular functions and be involved in the samebiological pathway.

[0105] Genes and proteins having similar functions or participating inthe same biological pathway are clustered/classified together.Therefore, functions of novel genes and proteins can be inferred basedon functions of the known genes or proteins in the same cluster.

[0106] The system effectively blends all available information toestablish functional linkages among proteins. This information may beused to identify genes that are likely drug target candidates. Smallmolecule-based high throughput screening (HTS) of the validated targetsmay be used to identify lead compounds. The lead compounds may then besubjected to downstream biochemical and cell-based assays foroptimization. After confirming the functional specificity and activityof the optimized lead compounds, their drug effects may be furthercharacterized in animal models and preclinical studies.

EXAMPLE

[0107] Clustering methods were used to analyze genes from YTH screening.The original data set contained about 300 baits and 2,300 hits. A totalof 391 baits and hits with multiple interactions were selected for theanalysis shown in FIG. 3. FIG. 4 shows the protein-protein interactionnetwork prior to gene clustering.

[0108] After clustering the first characteristic matrix, proteins thathave similar functions or participate in the same pathway were grouped.As shown in FIG. 5, FLAME1-γ (AF009618), Homo sapiens FLAME1 mRNA(AfO09616), MRIT-α-1 (U85059), I-FLICE (AF041458), and CLARP(AF005774.1) all encode the same protein, which is commonly referred toas FLIP. FLIP is a death-domain-containing anti-apoptotic molecule thatregulates Fas/NFR1-induced apoptosis.

[0109] Another group close to FLIP in the hierarchical tree containsTRAF1 (TNF receptor associated factor 1, NM005658), TRAF2 (U12597), TANK(TRAF family member associated NFKB activator, XP_(—)002533.1), receptorinteracting protein (RIP), RIP-like kinase (AF156884), BCL-2 associatedatlianogene (XP_(—)005538.1), protein phosphatase 2 regulatory subunit B(NP_(—)006234.1), and an unknown gene (BAB25712.1).

[0110] The literature has shown that TNF receptors lack intrinsiccatalytic activity. Death-domain-containing proteins RIP andTRAF-domain-containing proteins TRAF1, TRAF2, TRAF3) bridge TNFreceptors to several downstream signaling pathways. This bridging causesdiverse cellular responses including cellular proliferation,differentiation, effector functioning, and apoptosis. Proteinphosphatase 2A (PP2A) affects a variety of biological events includingapoptosis.

[0111] BAD is a pro-apoptotic member of the BCL2 family of proteins.PP2A can dephosphorylate BAD, resulting in apoptosis. BCL-2-associatedathanogene (BAG-1) is a heat shock 70-(Hsp70)-binding protein that cancollaborate with BCL-2 to enhance the anti-apoptotic activity of BCL-2.

[0112] The literature confirms that these proteins are all involved inapoptotic and anti-apoptotic signaling events initiated by TNF andBCL-2. The unknown protein BAB25712.1 might therefore play a role in TNFor BCL-2 signaling pathways. Our system placed FLIP in the vicinity ofRIP and the BCL-2-associated chaperon BAG-1. Several independent studiessupport this protein linkage assignment. FLIP can modulate the NFkappaBpathway and physically interacts with several signaling proteins, suchas the TRAFS and RIP. FLIP can also interact with a BCL-2 family member.

[0113]FIG. 6 shows the significant refinement in results when HMM domainsearch information was incorporated as an additional dimension. First,more proteins involved in TNF receptor-induced apoptosis join the group.Both CAP-1 (CD40-associated protein, L38509) and CRAF-1 (CD40receptor-associated factor 1, U21092) sequences are essentiallyidentical to TRAF3 by BLAST analysis (in other words, TRAF3 is includedin this pathway).

[0114] TRAF3, which contains a conserved TRAF domain, binds to the CD40(a member of the TNF receptor family) intracellular domain. Two otherTRAF family members, TRAF4 and TRAF6, were included in this group. ILP(IAP-like protein, U32974), MIHC (human homolog of IAP C, U37546), andp73 (AF079094) were also found to be new members.

[0115] Baculovirus inhibitors of apoptosis (IAPs) can prevent insectcell death. Both ILP and MIHC are human homologs of LAP. Rothe andcolleagues have shown that interactions of MIHC with TRAF1 and TRAF2inhibit apoptosis. Similarly, ILP can regulate cell death. P73 is ap53-like tumor-suppressor protein which regulates the cell cycle andapoptosis. Direct interaction of p73 with LAP-like proteins or TRAFs hasnot been reported, however. The inference analysis suggests that theapoptosis pathways of p73 and IAP-like proteins intersect or evenassociate physically.

[0116] We excluded genes specific to the BCL-2 apoptosis pathway, PP2A Bsubunit, and BCL2-associated athanogen, as the current apoptotic pathwayis more TNF specific. Finally, we also excluded two related proteins(RIP and RIPH) after we added domain information. We did this to weightthe YTH interaction pattern and domain information equally. When wetuned the domain weight up to five times the weight of YTH, RIP, andRIPH were retained, and we also introduced some unrelated proteins.

[0117] Although specific embodiments of the invention have beendescribed, various modifications, alterations, alternativeconstructions, and equivalents are also encompassed within the scope ofthe invention. The described invention is not restricted to operationwithin certain specific data processing environments, but is free tooperate within a plurality of data processing environments.Additionally, although the present invention has been described using aparticular series of transactions and steps, it should be apparent tothose skilled in the art that the scope of the present invention is notlimited to the described series of transactions and steps.

[0118] Further, while the present invention has been described using aparticular combination of hardware and software, it should be recognizedthat other combinations of hardware and software are also within thescope of the present invention. The present invention may be implementedonly in hardware, or only in software, or using combinations thereof.

[0119] The specification and drawings are, accordingly, to be regardedin an illustrative rather than a restrictive sense. It will however, beevident that additions, subtractions, deletions, and other modificationsand changes may be made thereunto without departing from the broaderspirit and scope of the invention as set forth in the claims.

We claim:
 1. A method for identifying attributes of biomolecules via the manipulation of biological data from a plurality of experimental and archival sources, comprising: collecting experimental data from one or more of a plurality of screening experiments; collecting archival biological data from one or more of a plurality of public or proprietary resources; storing said collected data in a database; converting the stored data into a format that facilitates mathematical analysis; and analyzing the stored data to determine attributes of biological molecules.
 2. The method of claim 1 in which potential therapeutic targets are identified from the attributes.
 3. The method of claim 2 further comprising: screening said targets for interactions with members of compound libraries.
 4. The method of claim 1 wherein the experimental data comprises: data from high throughput screening, yeast two-hybrid screening, protein array analyses, nucleic acid array analyses, SiRNA analyses, and functional gene screening
 5. The method of claim 1 wherein the archival data comprises: domain analyses, ontology vocabulary mapping, fold recognition, gene sequences, protein sequences, expressed sequence tags, single nucleotide polymorphisms, biochemical functions, physiological roles and structure/function relationships from web-based or conventional published literature.
 6. The method of claim 1 in which the database is a dynamic, intelligent, time variant database.
 7. The method of claim 1 in which the data are cleaned or weighted prior to integration.
 8. The method of claim 1 in which the correlation analysis comprises cluster analysis.
 9. The method of claim 8 in which the cluster analysis comprises unsupervised clustering analysis.
 10. The method of claim 9 in which the clustering analysis comprises hierarchical, K-mean, or self-organizing map algorithms.
 11. The method of claim 8 in which the cluster analysis comprises supervised cluster analysis.
 12. The method of claim 11 in which the cluster analysis comprises support vector machines or neural nets.
 13. A method for detecting protein-protein interactions via the analysis of biodata from a plurality of experimental and archival sources, comprising: collecting experimental data from one or more of a plurality of screening experiments; collecting archival biological data from one or more of a plurality of public or proprietary resources; storing said collected data in a database; converting the stored data into a format that facilitates mathematical analysis; and analyzing the data to identify interactions between known and unknown proteins.
 13. A computer-based system for the analysis of biodata from a plurality of experimental and archival sources, comprising: a first dynamic memory for receiving and storing experimental data from one or more of a plurality of screening experiments; a second dynamic memory for receiving and storing archival biological data; an operational relational database for manipulating said data; and processing means for analyzing said data to identify correlations between known and unknown biomolecules and to select potential therapeutic targets based upon the relationship correlations.
 14. The method of claim 1 in which the analysis is implemented via population and manipulation of matrices. 